Advanced Practice in ML
Handling data labels
Label multiplicity
to obtain enough labeled data, companies often use different data sources and annotators. This leads to the problem of label ambiguity: what to do when there are multiple conflicting labels for a data instance.
Data lineage
= a technique that helps keep track of the origin of each of data samples as well as its labels
Handling imbalanced dataset
Imbalanced data set makes the normal classification metrics, like accuracy, not work well. There are some solutions:
-
Choose other metrics for classification problems, such as precision, recall, F-score
-
Data-level methods: Resampling
- Undersampling: down-sampling the larger set
- by randomly throwing away some data from that set
- Oversampling: up-sampling the smaller set
- by making multiple copies of the data points in the smaller set (can cause the model to be over-fitting)
- by using synthetic data creation such as
- synthetic minority oversampling technique (SMOTE)
- use the existing data in the smaller set to create new data points that look like the existing ones: use the feature vectors of the minority classes to generate syntehtic data points that are between real data points and their k-nearest neighbours
- adaptive synthetic sampling methods (ADASYN)
- synthetic minority oversampling technique (SMOTE)
- Undersampling: down-sampling the larger set
-
Algorithm-level methods
- use Ensemble Learning methods, because each model in the ensemble can be trained on a different subset of the data
- keep the training data distribution intact but alter the algorithm to make it more robust to class imbalance
- cost-sensitive learning
- class-balanced loss
- focal loss
-
Use metric to measure imblance
- Class imbalance: the imbalance in the number of members between different facet values
- Difference in proportions of labels (DPL): the imbalance of positive outcomes between different facet values
Making feature generalization
Always consider two aspects with regards to generalization:
- feature coverage: the percentage of the samples that has values for this feature in the data -> the fewer values that are missing, the higher the coverage.
- the distribution of feature values
Model selection
- always consider the Interpretability and the Complexity of the model.
Combining models
We can get an additional performance gain by combining strong models (base models) made with different learning algorithms.
- Averaging
- = averaging the prediction of base models - works for regression
- Majority vote
- = get the majority votes from base models - works for classification
- Stacking
- = build a meta-model that takes the output of base models as input